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Abstract 

We present a graphical approach to deriving inequality constraints for 
directed acyclic graph (DAG) models, where some variables are unob- 
served. In particular we show that the observed distribution of a discrete 
model is always restricted if any two observed variables are neither ad- 
jacent in the graph, nor share a latent parent; this generalizes the well 
known instrumental inequality. The method also provides inequalities on 
interventional distributions, which can be used to bound causal effects. All 
these constraints are characterized in terms of a new graphical separation 
criterion, providing an easy and intuitive method for their derivation. 



1 Introduction 

Models based on directed acyclic graphs (DAGs) are commonly used for causal 
inference on account of their simple to understand conditional independence con- 
straints, and the intuitive appeal of using arrows to display causal dependences. 
If all the variables in a DAG are observed then causal quantities of interest are 
typically point identified, and derivable in terms of conditional probabilities. 
However, it is common for some variables to be unobservable, possibly repre- 
senting confounding factors which may bias inference; in this case we can only 
observe the marginal distribution over the remaining variables. 

The models which result from the marginalization of a DAG are much less 
well understood and, unlike DAGs, are not described merely in terms of condi- 
tional independence constraints. In particular, causal effects may not be point 
identified, and we can only hope for inequality constraints describing the range 
of possible values. 

Existing methods for deriving bounds on observed distributions are either 
specific to a particular model Pearl (1995); Balke and Pearl (1997), or compu- 
tationally intensive and lacking the intuitiveness of a graphical approach Bonct 
(2001); Kang and Tian (2006). See Ramsahai (2012) for an approach which is 
graphical in spirit, but uses computationally difficult variable elimination meth- 
ods. In this paper we take steps to remedy these problems by providing a simple 
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graphical separation criterion for determining the existence of constraints, and 
for constructing them explicitly. 

The remainder of the paper is organised as follows: §2 introduces DAGs and 
related terminology and notation. §3 gives a new method for deriving known 
constraints on the observed distribution of the instrumental variables model, 
and related causal effects. §4 applies these methods to give new constraints for 
general DAG models, and §5 contains examples. A discussion is found in §6, 
and longer proofs are in an appendix. 

2 Graphical Models 

A directed graph Q is a set of vertices V, with a collection of ordered pairs of 
distinct vertices, or edges, £. If (X, Y) G £ we write X — > Y, and say that X is 
a parent of Y. The set of parents of Y is denoted p&g(Y). A path is a sequence 
of adjacent edges in a graph, without repetition of vertices; for example, the 
graph in Figure 1(a) contains the path m : Z — > X <— U — > Y. A path is 
directed from X to Y if all the arrows point away from X and towards Y. If 
there is a directed path from X to Y we say that Y is a descendant of X, and 
X an ancestor of Y. A directed graph is acyclic if there is no directed path 
from a vertex to itself; such an object is called a directed acyclic graph (DAG). 

We associate each vertex X with a random variable under some multivariate 
distribution P; let P admit a density /. For convenience, in what follows we 
will use X to denote both the vertex and the random variable, and similarly use 
operators and bold face letters (e.g. p&g(X), C) to refer to both a set of vertices 
and the associated vector of random variables. The factorization criterion for 
DAGs says that P is in the model corresponding to the DAG Q if the joint 
density factorizes as Ilvev fW I P a e(^))- 

Internal vertices on a path with two adjacent arrowheads are called colliders 
on the path; other internal vertices are non-colliders. On the path 7Ti, X is a 
collider, and U a non-collider. A path n from X to Y is blocked given a set of 
vertices C if there is a non-collider on 7r in C, or a collider on tt which is not 
an ancestor of any vertex in C. 

We say that two sets of vertices A and B are d-separated given a set of 
vertices C, if every path from any vertex in A to any vertex in B is blocked by 
C. A probability distribution P obeys the global Markov property for a DAG Q 
if whenever A and B are d-separated by C in Q, then A X B \ C [P]. 

It is well known that d-separation is equivalent to the factorization criterion 
Verma and Pearl (1988). In particular, all constraints implied by a DAG on fully 
observed random variables can be interpreted as conditional independences. 

Assigning a causal interpretation to a DAG model requires extra assump- 
tions, in particular that the system under observation is stable under interven- 
tions with respect to the graph. We will denote an intervention to fix X = x by 
do{X = x), or do(x) for short; graphically this may be represented by removing 
the edges of the form Y — > X, so that X has no parents in the new graph. The 
density f(V | do(x)) is given by dividing the joint density / by f(x \ pa, g (Xj) 
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Figure 1: (a) The instrumental variables (IV) model; U is unobserved, (b) The 
IV model with the effect of X on Y removed. 

and multiplying by the indicator function l{x=x}- See Pearl (2009) for details. 

If some of the variables in a DAG are unobserved, we may be interested in 
the implications of the underlying graph for the observable margin. Let U C V 
denote the set of latent or unobservable vertices; the observable margin is then 



The marginal distribution over the observed variables is completely identifiable, 
but some of the structure of the underlying graph may be impossible to deter- 
mine in the presence of latent variables. We will make no assumption about the 
state space of the latent variables, since these are unobserved. Some conditional 
independences may still be observable, but other kinds of constraint also arise, 
including Verma constraints Verma and Pearl (1991), and inequalities on the 
observed distribution (see next section). 

Without loss of generality we will assume that none of the latent variables 
have any parents. 

3 Instrumental Variables 

Perhaps the most thoroughly studied causal DAG model is the instrumental 
variables model, represented in Figure 1(a). It arises naturally in randomized 
trials with imperfect compliance, in which Z represents a randomized treatment 
assignment, X the treatment actually taken by the subject, and Y an outcome; 
U represents unmeasured confounding factors which may affect both the prob- 
ability of the subject taking the treatment and the outcome of interest, so that 
naive estimators of the effect of X on Y will be biased. 

The graph encodes (amongst other assumptions) that the assignment Z does 
not affect the outcome Y other than through the treatment X. This is known 
as the exclusion restriction, and is important for assessing the effect of X on Y; 
implications of the exclusion restriction which can be subjected to an empirical 
test are therefore very useful. 

Making no assumptions about the character of U , and if X is continuous, 
the observable margin is unconstrained Bonet (2001). However, if the observed 
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variables have finite and discrete state spaces, then the observed distribution 
obeys the instrumental inequality of Pearl (1995): 



max 



X 



Ema,xp(x,y\z) < 1; 
Z 



(2) 



y 



here p(x, y\z) is used to denote P(X = x,Y = y\ Z = z). This restriction can 
be used to falsify the IV model. Pearl's proof of the inequality is model specific, 
and it is not clear how it might be applied to other graphs. Below we present 
a new approach to the derivation of (2), and a more graphical interpretation of 
its meaning; as we shall see, this method can be adapted to many other DAG 
models, and provides some causal constraints. 

Proposition 3.1. Let P be a probability distribution over three random vari- 
ables Z , X and Y , taking values in discrete sets Z, X and y respectively. Then 
P obeys the IV model only if for each ^ ^ X , the collection of conditional prob- 
abilities y \ z),y £ y, z e Z) is compatible with a distribution under which 



In other words, only if for each £ G X there exists a distribution P* such 
that Y X Z\P% andp*(£,y\ z) =p{£,y\z) for each y ey and z eZ. 
This condition implies the instrumental inequality (2). 

Proof. Suppose that P is in the IV model. Then 



Under P* , the effect of X on Y has been broken, because Y behaves as though 
X = £ regardless of its actual value. P* obeys the factorization criterion with 
respect to the graph in Figure 1(b); thus Y i Z[P*}, and by construction 
y | z) = p(£, y | z) for each y e y and z <E Z. 
To see that this implies (2), first note that the independence is equivalent to 



for each y e y, z, z' G Z. Suppose we are given the probabilities p(£, y | z) and 
asked to construct a distribution P* satisfying these equations. Since all the 
quantities are positive, and this equality holds for each z', we have 



Y ±Z. 




construct a distribution P* by 




P*(y\z) =p*(y\z') 

p(£> y\ z )+^p*i x , y\z)=p(£, v\z')+^p*{x, y\z') 




maxp(£, y\z')- p(£, y\z)< Vp*(i, y \ z). 

z' 
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Figure 2: The model from Figure 1(a) with a possible effect of Z on Y added. 

However the sum of the quantities on the RHS over y cannot be greater than 
1 -p{€ I z) = 1 - V I z), so 

(maxp^, y | z') y | z)J < 1 - ^2p((,y\ z) 

y y 

Vmaxp(£, y\z')< 1. 

* — ' z' 

2/ 

Applying this to each £ gives (2). □ 

Remark 3.2. Whilst these inequalities are not new, the importance of the above 
result lies in the proof technique; we will see in the next section that it generalizes 
to many other DAG models, giving novel results. 

The instrumental inequality is exact when X , Y and Z are binary, but insuf- 
ficient if Z takes three states Bonet (2001). The sufficient bounds are difficult 
to derive without using computationally intensive linear programming techniques 
and Fourier- Motzkin elimination, which become infeasible for moderately sized 
state spaces. 

3.1 Causal bounds on the IV model 

We next try to invert the problem and ask how much effect Z can have on Y 
given the observed distribution. In some sense we are trying to quantify the 
strength of the dashed arrow in Figure 2. A suitable measure is the average 
controlled direct effect (ACDE) of Z on Y, controlling for X = x; this is defined 
for binary Z, X and Y as 

ACBEz^y (x) = p{yi | do(zi,x)) - p(yi \ do(z , x)). 

Here y\ is a shorthand for {Y — 1}, whilst x means {X = x}, etc. Generaliza- 
tions to non-binary state spaces are also possible Cai et al. (2008). Note that 
ACDE^_j.y (x) = for each xiiZ^Y. For the DAG in Figure 2, 

p(y | do(z, x)) = / f(u)p(y \x,z,u)du, 
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which is not identified. However, constructing P* as above, 
P(y\ do(2,£)) = / f(u)p*(y\z,u)du 

J U 

= p*(y I z ) 

= v{y-,i\z) + ^p*{y,x\z) 

<p(y,Z\z) + l-p(Z\z). 

Also p(y | do(z, 0) > £ | so 

ACDE z ^y(x) < p(yi,x zi) + l-p(x | zi)-p(yi,x\ z ) 
= 1 -p(yo,a;Ni) -p{yi,x\z ), 

and similarly 

ACDE z ^y (x) > p(yi,x \ z{) + p(y , x\z )- 1. 

Note that the ACDE bounds include zero if and only if the instrumental in- 
equality (2) is satisfied. These bounds were derived by Cai et al. (2008) using 
linear programming, and shown to be tight. In the next section we will extend 
this method to other graphs. 

4 Other Models 

Just as d-separation provides a graphical criterion for finding observable condi- 
tional independences, we now provide a graphical criterion for finding observable 
inequality constraints. For a DAG Q with vertex set V and edge set £, define 
the induced subgraph Qw for W C V as the DAG with vertex set W and edge 
set £ n (W x W). 

Now we define our new separation criterion: let A, £?, C and D be disjoint 
sets of observed vertices. A and B are e-separated (extended d-separation) 
given C after deletion of D in Q, if A and B are d-separated by C in Qv\d- 
In other words, if we remove the vertices in D from the graph, then A and B 
are d-separated by C. 

For example, in the graph in Figure 1(a), Z and Y are e-separated after 
deletion of X. The following lemma gives an alternative characterization of 
e-separation which will prove useful. Its proof is elementary, and omitted for 
brevity. 

Lemma 4.1. Let Q be a DAG, and let Q* be the DAG formed from Q by remov- 
ing all edges which are oriented away from some vertex in D (i. e. of the form 
D — > E for D e D). Then A is e-separated from B by C after deletion of D 
in Q if and only if A is d-separated from B by C in Q* . 
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Graphs formed by removing the edges emanating from vertices form a part 
of Pearl's do-calculus Pearl (2009). The node-splitting method in Robins ct al. 
(2006) is also related. 

Suppose now that we are interested in the detecting the presence or absence 
of the edge X — > Y in a general graph, and in estimating the strength of the 
(direct) causal effect of X on Y. We first show that if X and Y are not directly 
confounded with each other, which is to say that they do not share a latent 
parent, then falsifiable constraints (such as the instrumental inequality) for the 
absence of the edge X Y always exist. 

Theorem 4.2. Let Q be a DAG, and let A, B, C and D be disjoint sets of 
observable vertices such that no vertex in C is a descendant of any in D. If A 
is e- separated from B by C after deletion of D, then for any fixed value D = d, 
the conditional probabilities p(a, b,d\c) must be compatible with a distribution 
P* in which A ±B\C[P*}. 

If in addition no vertex in A is a descendant of any element of D, then 
the probabilities p(b, d\a,c) must be compatible with a distribution P* in which 
A±B\C[P*}. 

Corollary 4.3. Let Q be a DAG containing observable vertices X,Y, which do 
not share a latent parent nor are joined by an edge; let Q' be equal to Q, except 
that X — > Y in Q' . Then if the observed variables in the graphs are discrete, the 
model defined by the observed margin of Q' is strictly larger than the one defined 
byQ. 

Proof. Under the conditions given, we can apply Theorem 4.2 to Q with A = 
{X}, B = {Y}, C = and D = V \ (U U {X, Y}). 

To see that this implies a constraint, consider a distribution in which all 
vertices other than X and Y are completely independent, and P(D = d) = 
1 - e for some arbitrarily small e > 0. Then P{X, Y, D \ C) ss P(X, Y\C) = 
P(X,Y), and if X and Y are strongly correlated, it becomes impossible to find 
a compatible distribution under which X X Y\C. However, since the only 
dependence is between X and Y, such a distribution would certainly obey the 
global Markov property with respect to Q' , which contains the edge X — > Y. □ 

Remark 4.4. In other words, the Corollary states there exists some non-trivial 
(i.e. falsifiable) condition on the joint distribution which must be satisfied under 
Q , but not necessarily under Q' . In many cases we can choose smaller sets D 
than the one used in the proof of Corollary 4-3; the generated inequalities will 
tend to be more powerful if D is smaller, so certainly a minimal set should be 
used. 

It is important to stress that this result is not a causal one, and the con- 
straints are merely a consequence of marginalizing distributions obeying certain 
conditional independence constraints. In the next subsection, however, we will 
extend this method to estimate the strength of causal relationships. 

In the IV graph in Figure 1(a), Z is e-separated from Y after deletion of X, 
giving an inequality constraint. In general, the additional constraint implied by 
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the Theorem may be an inequality or a conditional independence (if D = 0); an 
inequality constructed will in some cases be a weaker manifestation of a Verma 
constraint, or possibly some other as yet unknown form of equality constraint. 
Verma constraints are still poorly understood; see Tian and Pearl (2002) for 
methods on deriving them. 

Remark 4.5. Theorem 4-2 can be extended to continuous state spaces without 
difficulty, but it is necessary for the set D to contain only discrete variables. The 
IV model from Figure 1(a) with continuous X is unconstrained, for example. 

4.1 Causal Bounds 

As with the IV model, we can find bounds on the average controlled direct effect 
due to the edge X — > Y in arbitrary models, so long as X and Y are not directly 
confounded. First we generalize the average controlled direct effect slightly to 
allow conditioning: 

ACDE x ^ y (d | c) = p( yi | do(x u d), c) 

-p(yi | do(x ,e£),c). 

In general ACDEx ->y(<2 | c) ^ ACDEx ->y(d, c), but for appropriate graphs if 
ACBEx^ Y (d | c) ^ 0, then X -> Y. 

Theorem 4.6. Let Q be a DAG containing the edge X — > Y and observable 
sets of vertices C, D such that no vertex in C is a descendant of one in D. 
Suppose further that if the edge X — > Y is removed, X is e-separated from Y by 
C after deletion of D. 
Let 

L(x,y,d\c) 
U(x,y,d\c) 

Then 

L(x, y,d\c)< p(y | do(x, d),c) < U(x, y,d\c) 

and consequently for binary X and Y , 

L(x uyi ,d\c) - U{x Q , yi ,d\c) < ACBE X ^ Y (d\c) 
< U{x 1 ,y 1 ,d\c) - L(x ,yi,d\c). 

If in addition X is not a descendant of any vertex in D, these inequalities can 
be strengthened using 

L(x,y,d\c) =p(y,d\x,c) 

U (x, y,d\c)= p(y, d \ x, c) + 1 - p(d \ x, c). 



= max <^ 0, — t—- . , . , > 

I p{x, a | c) + 1 — p{d | c) J 

. fp(x,y,d\c) + l-p(d\c) \ 

= mm < ; : ; : — , 1 > 

1 p(x,d\c) + l-p(d\c) J 
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Figure 3: The unrelated confounding (UC) model. 




(a) (b) 

Figure 4: (a) A DAG with three independent unobserved variables; we have 
avoided explicitly drawing a vertex for each of the three unobserved variables, 
and instead use a bidirected (•<->) edge to indicate its two (observed) children, 
(b) The same graph after deletion of W. 

Proof. See appendix. □ 

Remark 4.7. This result shows that we can always bound the effect correspond- 
ing to a directed edge, at least for some observed distributions, provided the two 
variables involved are not directly confounded with one another. The bounds 
for the ACDE include zero if the compatibility requirement from Theorem 4-2 
is satisfied. If they exclude zero, then the edge X — > Y must be present in the 
graph (given the other assumptions). 

5 Examples 

The graph in Figure 3, which we refer to as the unrelated confounding (UC) 
model, has no edge between Z and Y, and nor are these two variables directly 
confounded. Theorem 4.2 and Corollary 4.3 therefore tell us that in the dis- 
crete case, the joint distribution of (X, Y, Z) is restricted, and in particular that 
for each £, the joint probabilities p(£, y, z) must be compatible with a distri- 
bution in which Z X Y. Let p^ = p(xi,yj, z^); in the binary case, given 
Pooo,Poio,Pooi,Pon, we need to find non- negative P* 10 oiPiwPwiiPiii sucn that 

(pooo +Pwo)(Pon +Piu) = (poio +Pn )(Pooi +P*oi) 

and J2jk Pojk + 12 jk Pijk = 1- This will not be possible if, for example, pooo and 
Pon are both large; that is, we cannot have both P{X — £) be large and Z and 
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Y strongly correlated conditional on X = £. Unlike in the IV model we cannot 
apply the stronger condition of Theorem 4.2, because Z is a descendant of X. 
We remark that (observationally) the UC model strictly contains the IV model 
in Figure 1(a). Note that a linear programming approach to finding constraints 
on this graph is not possible, so the constructive nature of the proof of Theorem 
4.2 is crucial in determining how we can test this model. 

The graph in Figure 4(a) is constrained in the discrete case because there is 
no edge between X and Y. Specifically X is e-separated from Y given W after 
deletion of Z, and also given Z after deletion of W (the latter being illustrated 
in Figure 4(b)). Note that X is a descendant of W, but not of Z, so the bounds 
given by Theorem 4.6 are not symmetric in the two cases. For example: 

p(y | do(x, w),z) < p(y, w | x, z) + 1 — p(w \ x, z) 

p(x,y,z\w) + l-p(z\w) 

p(y\ do(x, z),w) < — — — . 

p(x, z I w) + 1 — p\z I w) 

The first bound is likely to be stronger, though this will not hold in all cases. 



6 Discussion 

We have presented a graphical approach to finding inequality constraints in 
distributions corresponding to marginalized DAGs, based on the e-separation 
criterion. It can be shown that the bounds derived from the algorithm of Kang 
and Tian (2006) also imply the causal constraints given in Theorem 4.6, however 
that approach involves listing exponentially many inequalities and then using 
Fourier-Motzkin elimination to derive bounds. For even modestly sized graphs 
this becomes infeasible because Fourier-Motzkin is doubly-exponential in the 
number of variables in the elimination. 

The advantage of the results given above is that they are 'off the shelf, in the 
sense that we need only check the conditions of the Theorems and then apply the 
results. Exhaustively searching possible sets C and D would be computationally 
intensive, but in many cases it is likely that good heuristics could be obtained 
for their selection. This could be highly advantageous in systems with large 
numbers of variables, especially during computationally intensive model search 
procedures. A further benefit of the e-separation criterion is that it is much 
easier and more intuitive for a human user to apply than using the algorithm of 
Kang and Tian (2006). 

The bounds derived from Theorem 4.2 are known not to be tight in some 
cases, including the IV model when the instrument takes three or more states. 
However finding constraints from marginalized models is computationally inten- 
sive, even if the inequalities are linear, so a fast method for finding a subset of 
conditions may be very useful in practice. 
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A Proofs 



Proof of Theorem 4. 2. By the global Markov property for DAGs, the joint dis- 
tribution P over the observed variables takes the form (1). Now, for each fac- 
tor /(V I pag(V)), construct a new conditional density f*(V\ pa*(V)) where 
pa*(V) = pag(V) \ D, by fixing any element of D n p&g(V) to the value speci- 
fied by D — d. Note we only fix elements in the conditioning set, so /* is still 
a valid conditional density. 

Then the joint distribution P* given by 



/ J] f*(V\va*(V))dU 
Jueu VeV 



factorizes according to the DAG Q* formed by removing any edges in Q which 
originate in D (i.e. the non-arrowhead end is incident to a vertex in D). By 
Lemma 4.1, A and B are d-separated by C in Q* , and therefore the global 
Markov property for DAGs says that A X B \ C [P*\. Further, P(a, b, c, d) = 
P*(a,b,c,d) for the fixed D = d and any a,b,c, and P*(c) = P(c) because 
the distribution of vertices ordered before D will be unchanged. This gives the 
compatibility condition. If A is also ordered before D then P*(a, c) = P(a, c), 
giving the stronger condition. □ 

Proof of Theorem 4-6. For simplicity we will assume C = 0, but the extension 
to the general case is easy. Let P* be the distribution formed by fixing D = d 
in conditioning sets in the factorization of P, as in the proof of Theorem 4.2. 
Then 

p(y\ do(x, d))=p*(y\x) - P lM - J ' ] 



p*(x) 

_ p(y, x, d) + Y.d'^dP*(y-> x -> d ') 

p(x,d) + J2d>TtdP*( x > d ') 

Clearly ^2 d >- id P*{y,x 7 d') < J2d'^dP*( x > d ') < 1 — p(d); the expression is max- 
imized by both these sums taking their largest possible values, and minimized 
when the first is zero and the second is 1 —p(d). This gives the main result. 
If X is not a descendant of D we have p*(x) — p(x), and arrive at the tighter 
bounds by a similar analysis. □ 
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